The Annals of Applied Statistics
● Institute of Mathematical Statistics
Preprints posted in the last 90 days, ranked by how well they match The Annals of Applied Statistics's content profile, based on 15 papers previously published here. The average preprint has a 0.00% match score for this journal, so anything above that is already an above-average fit.
Gill, P.; Bleka, O.
Show abstract
The interpretation of trace DNA evidence at activity level requires explicit modelling of transfer, persistence, and failure to detect a person of interest. We present the theoretical foundations of HaloGen, an open-source hierarchical Bayesian framework for evaluating biological results under competing activity-level propositions, such as direct versus secondary transfer. HaloGen accounts for dropout, multiple contributors, and multiple stains. Evidence is evaluated using an exhaustive-propositions likelihood ratio frame-work that combines information across contributors and stains, while fully accounting for uncertainty in transfer and detection. Observed DNA quantities and non-detects are handled consistently within a single probabilistic model, avoiding reliance on fixed parameter estimates. The framework yields intuitive and robust behaviour: strong support for direct transfer when DNA quantities are informative, and appropriately neutral or defence-leaning likelihood ratios in low-information or non-detect scenarios. An empirically constrained fail-rate parameter prevents spurious inflation of likelihood ratios when offender detection is unlikely, providing stability across laboratories and experimental conditions. This paper establishes the theoretical basis of HaloGen; a companion paper addresses validation and applied casework examples.
Mateusiak, C.; Jia, E.; Plaggenberg, J. N.; Erdenebaatar, Z.; Wang, Y.; Shively, C.; Liao, G.; Mitra, R. D.; Brent, M. R.
Show abstract
Most genes in whose promotor a transcription factor (TF) binds do not change in expression when the concentration of the TF is perturbed. No existing model can predict which bound promotors will respond and which will not. We hypothesized that a genes response to perturbation of a TF bound in its promotor can depend on which other TFs are bound there, a phenomenon we call functional synergy. This is distinct from cooperative binding, which is already accounted for in the binding location data. To investigate functional synergy, we created a comprehensive dataset on TF binding locations in yeast using a method that is orthogonal to chromatin immunoprecipitation. We then used mathematical modeling to identify high-confidence instances of functional synergy. We found that such synergies are surprisingly common. Responses to perturbations of 44 different TFs were modified by the presence of other TFs. 48 TFs served as modifiers, but some modified responses to many TFs. We conclude that (1) measuring the binding locations of a single TF will not, in general, reveal which genes the TF regulates, and (2) traditional networks linking TFs to their targets must be made substantially more expressive, allowing some TFs to modify the effects of others.
Yelmen, B.; Güler, M. N.; Estonian Biobank Research Team, ; Kollo, T.; Möls, M.; Charpiat, G.; Jay, F.
Show abstract
Over the past two decades, genome-wide association studies (GWAS) enabled the discovery of thousands of variants associated with many complex human traits. However, conventional GWAS are still widely performed with linear models with the assumption that the genetic effects are predominantly additive. In this work, we investigate the test statistic behavior when linear models are used to obtain significant genotype-phenotype associations without accounting for epistasis. We first algebraically derive mean and variance shift in the null statistic due to the omitted interaction term, and define the boundary between conservative (i.e., deflated statistic tail) and anti-conservative (i.e., inflated statistic tail) regimes for the common GWAS significance threshold. We then perform phenotype simulation analyses using the Estonian Biobank genotypes and validate the mathematical model. We demonstrate that the anti-conservative regime is plausible under realistic parameter settings and models omitting interaction terms can produce spurious significance. Our findings suggest caution when interpreting statistically significant signals reported in the literature based on linear models, especially for large-scale GWAS.
Kesimoglu, Z. N.; Hodzic, E.; Hoinka, J.; Amgalan, B.; Hirsch, M. G.; Przytycka, T. M.
Show abstract
Mutational signatures represent characteristic mutational patterns imprinted on the genome by mutagenic processes. They can provide information about the impact of the environmental and endogenous cellular processes on tumor mutations and can suggest treatment. Analysis of presence and strength of mutational signatures in cancer genomes has become a cornerstone in analysis of new and legacy cancer data. However, a precise inference of novel (de novo) signatures requires a large set of genomes, and methods focusing on estimating the presence of previously defined signatures are unable to uncover potential novel signatures that might emerge in new data. Thus, reliable methods to address these challenges are needed. We formally define the Combined Mutational Signature Inference Problem (CMSI) for the identification of known signatures and the inference of novel signatures in cancer data. CMSI represents non-convex optimization, and we provide a heuristic algorithm, ReDeNovo, to solve it efficiently. We extensively validated ReDeNovo on simulated data, evaluating its ability to precisely estimate presence and exposure to known signatures and to discover of novel signatures. On both tasks ReDeNovo outperformed existing approaches. In real biological data, ReDeNovo identified signatures missed by previous analyses and defined a new signature related to UV light exposure. ReDeNovo method provides a new and powerful tool to uncover mutational signatures. ReDeNovo is available from https://github.com/ncbi/redenovo.
Sosa, S.; Brooke McElreath, M.; Ross, C.
Show abstract
O_LIBayesian modeling is a powerful paradigm in modern statistics and machine learning. However, practitioners face significant obstacles in building bespoke models. C_LIO_LIThe landscape of Bayesian software is fragmented across programming languages and abstraction levels. Newcomers often gravitate towards high-level interfaces, like R, in order to use simple generalized linear models (GLMs) through interfaces like brms. C_LIO_LIFor niche problems, researchers must often transition to writing directly in lower-level programming languages, like Stan or JAX, which require specialist knowledge. C_LIO_LIFurthermore, computational demands remain a significant bottleneck, often limiting the feasibility of applying Bayesian methods on large datasets and complex, high-dimensional models. C_LIO_LIThe Bayesian Inference (BI) is a cross-platform software distributed as a Python, R and Julia library. It provides an intuitive model-building syntax with the flexibility of low-level abstraction coding, while also providing pre-built GLM functions. Further, by facilitating hardware-accelerated GPU computation under-the-hood, BI permits high-dimensional models to be fit in a fraction of the time of comparable Stan models (up to 200-fold). C_LI
Melton, H. J.; Bradley, J. R.; Wu, C.
Show abstract
Oral squamous cell carcinomas (OSCC), the predominant head and neck cancer, pose significant challenges due to late-stage diagnoses and low five-year survival rates. Spatial transcriptomics offers a promising avenue to decipher the genetic intricacies of OSCC tumor microenvironments. In spatial transcriptomics, Cell-type deconvolution is a crucial inferential goal; however, current methods fail to consider the high zero-inflation present in OSCC data. To address this, we develop a novel zero-inflated version of the hierarchical generalized transformation model (ZI-HGT) and apply it to the Conditional AutoRegressive Deconvolution (CARD) for cell-type deconvolution. The ZI-HGT serves as an auxiliary Bayesian technique for CARD, reconciling the highly zero-inflated OSCC spatial transcriptomics data with CARDs normality assumption. The combined ZI-HGT + CARD framework achieves enhanced cell-type deconvolution accuracy and quantifies uncertainty in the estimated cell-type proportions. We demonstrate the superior performance through simulations and analysis of the OSCC data. Furthermore, our approach enables the determination of the locations of the diverse fibroblast population in the tumor microenvironment, critical for understanding tumor growth and immunosuppression in OSCC.
Hoyt, S. H.; Reddy, T. E.; Gordan, R.; Allen, A. S.; Majoros, W. H.
Show abstract
Interpreting the effects of novel mutations on phenotypic traits remains challenging, particularly for cis-regulatory variants. For rare variants, individuals typically possess at most one affected copy of the causal allele, leading to allelic imbalance, and thus the ability to infer inheritance of allelic imbalance can inform genetic studies of phenotypic traits. While many methods for detection of allele-specific expression (ASE) exist, they largely focus on ASE in one individual. We show that performing joint inference across multiple individuals in a trio allows for simultaneously improving estimates of ASE and identifying its likely mode of inheritance. Our Bayesian approach has the benefit of being able to (1) aggregate information across individuals so as to improve statistical power, (2) estimate uncertainty in estimates, and (3) rank modes of inheritance by posterior probability. We demonstrate that this model is also applicable to other forms of imbalance such as allele-specific chromatin accessibility. Applying the model to ATAC-seq and RNA-seq from several trios, we uncover examples in which ASE can be linked to imbalance in chromatin state of cis-regulatory elements and to potential causal variants. As the cost of sequencing continues to decrease, we expect that powerful methodologies such as the one presented here will promote more routine collection of samples from related individuals and improve our understanding of genetic effects on gene regulation and their contribution to phenotypic traits.
Frost, H. R.
Show abstract
We describe an approach for analyzing biological networks using rows of the Krylov subspace of the adjacency matrix. Specifically, we explore the scenario where the Krylov subspace matrix is computed via power iteration using a non-random and potentially non-uniform initial vector that captures a specific biological state or perturbation. In this case, the rows the Krylov subspace matrix (i.e., Krylov trajectories) carry important functional information about the network nodes in the biological context represented by the initial vector. We demonstrate the utility of this approach for community detection and perturbation analysis using the C. Elegans neural network.
Kornilov, S. A.
Show abstract
Shenhar et al. (2026) report 50% "intrinsic" lifespan heritability after calibrating a one-component correlated-frailty survival model to Scandinavian twin lifespans. Their framework is mathematically coherent, but the intrinsic component is not identified if heritable, mortality-relevant extrinsic susceptibility is omitted at calibration. We show that one-component calibration absorbs omitted familial extrinsic structure into the intrinsic frailty scale parameter{sigma}{theta} , and that this variance absorption is visible through separate diagnostics (1) Variance absorption. Under misspecification,{sigma}{theta} is inflated by +22.1% (95% CI: 21.5-22.7%), corresponding to +49% inflation in [Formula]. Falconer h2 is downstream of calibration and inherits a +9.2 pp bias (95% CI: 8.7-9.7). The{sigma}{theta} inflation is model-general: +22% (GM), +18% (MGG), +14% (SR); any dependence summary that is strictly increasing in{sigma}{theta} inherits this inflation, so Falconer h2 is one affected downstream quantity among many (Corollary B3). (2) Structural fingerprint. In the joint twin survival surface S(t1, t2), misspecification produces systematic dependence errors (ISE 48x that of the recovery model). Conditional twin dependence is inflated at all ages, peaking at age 80 ({Delta}r = 0.048). (3) Specificity. The bias requires an omitted component that is both heritable and mortality-relevant. Three negative controls, a boundary check ({rho} = 0), and a two-component recovery refit ({sigma}{theta} restored to within -3.2%) establish specificity. ACE decomposition yields C {approx} 0 throughout: the omitted extrinsic component loads onto A (because it is shared 1.0/0.5 in MZ/DZ), so switching summary statistics does not restore identification. (4) Sensitivity and falsifiability. Over an empirically anchored regime ({sigma}{gamma} [isin] [0.30, 0.65],{rho} [isin] [0.20, 0.50]), Falconer bias ranges from +2.8 to +18.9 pp (mean 9 pp). If{rho} is sufficiently negative, the bias reverses sign in all three model families (Corollary B4). A full-likelihood robustness check shows that this upward pull is partly structural and partly estimator-specific: in the same misspecified one-component model, ML still inflates{sigma}{theta} (+3%), whereas matching only rMZ inflates it much more (+21%). These results do not resolve true intrinsic heritability but establish that Shenhars 50% estimate carries a structured, model-general upward bias originating in the fitted latent variance{sigma}{theta} .
Sato, Y.; Hamazaki, K.
Show abstract
Individual phenotypes often depend on the genotypes of other individuals within a group. These phenomena are termed indirect genetic effects (IGEs) and have been distinguished from direct genetic effects (DGEs) using quantitative genetic models. Recent studies have utilized high-resolution polymorphism data to enable genomic prediction (GP) and genome-wide association study (GWAS) of IGEs, but unified methods remain limited. Here we integrate polygenic and oligogenic IGEs using a multi-kernel mixed model incorporating two random effects with a single covariance parameter. Underlying this implementation, the Ising model of ferromagnetics enabled us to simplify locus-wise and background IGEs for GWAS and GP, respectively. Our simulations demonstrated that, while the previous and present models exhibited similar performance, the present model can infer a trade-off between DGEs and IGEs. By applying this method to three species of woody plants, we found evidence for intergenotypic competition in aspen and apple trees, but limited evidence in climbing grapevines. Based on GWAS, we also detected significant variants associated with the competitive IGEs on the apple trunk growth. Our study offers a flexible implementation for GWAS/GP of IGEs, thereby providing an effective tool to dissect the genetic architecture of group performance.
Wang, R.-Y.; Dinh, K. N.; Taketomi, K.; Pang, G.; King, K. Y.; Kimmel, M.
Show abstract
Clonal hematopoiesis (CH) arises when hematopoietic stem cells (HSCs) gain a fitness advantage from somatic mutations and expand, resulting in an increase in variant allele frequency (VAF) over time. To analyze CH trajectories, we develop a state-dependent stochastic model of wild-type and mutant HSCs, in which an environmental parameter [isin] [0, 1] regulates death rates and interpolates between homeostatic (Moran-like, = 1) and growth-facilitating ( < 1) regimes. Using functional law of large numbers and central limit theorems, we derive explicit mean-field dynamics and a Gaussian-Markov approximation for VAF fluctuations. We show that the mean VAF trajectory has an explicit logistic form determined by selective advantage, while environmental effects affect only the variance and autocovariance structure. Building on these results, we introduce BESTish (Bayesian estimate for selection incorporating scaling-limit to detect mutant heterogeneity), a novel, efficient and accurate Bayesian inference method that can be applied to analyze both cohort-level and longitudinal VAF datasets. BESTish implements the closed-form finite-dimensional distributions that we derive to estimate mutation fitness, mutation rate, and environmental strength for individual CH drivers. When applied to existing CH datasets, BESTish produces consistent mutation fitness inferences across different studies, and estimates CH driver mutation rates in agreement with independent experimental studies. Furthermore, BESTish reveals patient-specific heterogeneity in the selective behavior of recurrent mutations, and identifies variants whose dynamics are compatible with non-homeostatic, growth-facilitating environments. BESTish provides a unified and mechanistic framework for quantifying CH evolution, with potential applications for other biological systems where clonal expansions can be measured.
Hao, H.; Chen, D.; Qian, C.; Zhou, X.; Huang, H.; Zuo, J.; Wang, G.; Peng, X.; Liu, H.-X.
Show abstract
Causal effects in complex traits are typically represented by a single linear slope. While conventional Mendelian randomization (MR) provides efficient scalar estimates, projection-based summaries do not explicitly capture structural organisation in joint effect space under genetic heterogeneity. We introduce MR-UBRA (Mendelian randomization-Unified Bayesian Risk Architecture), a probabilistic framework that decomposes instrumental variants into genetic risk fragments (GRFs) and quantifies extreme deviations using tail-risk metrics defined on the standardised residual magnitude |e|. MR-UBRA preserves the classical MR estimand while offering a structurally resolved representation of genetic heterogeneity. Across stroke subtypes, AF[->]CES, smoking[->]lung cancer, and BMI[->]T2D, effect-space distributions exhibit reproducible asymmetry, amplitude stratification, and multi-modal structure. MR-UBRA resolves component-level organisation, separating tail-dominant contributions from the causal core while maintaining consistency with the classical MR estimand. Simulations and boundary regimes demonstrate adaptive model complexity: MR-UBRA selects K>1 when multi-component structure is present and collapses to K=1 under homogeneous conditions, avoiding spurious stratification. These results support viewing causal effects in complex traits as structured distributions in joint effect space, enhancing causal representation without altering the MR estimand. Graphical AbstractMendelian randomization (MR) typically represents causal effects with a single linear slope. Under genetic heterogeneity, instrumental effects in joint ({beta}X, {beta}Y) space may exhibit multi-component structure and amplitude stratification that cannot be captured by a scalar summary. MR-UBRA fits a standard error-weighted mixture model to decompose instruments into genetic risk fragments (GRFs), estimates GRF-specific effects using posterior-weighted soft-IVW, and quantifies extreme deviations through tail-risk metrics (VaR/CVaR). Across empirical analyses and boundary regimes, MR-UBRA adapts model complexity (K) to structural signal, collapsing to K=1 under homogeneous conditions. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=144 SRC="FIGDIR/small/26348288v1_ufig1.gif" ALT="Figure 1"> View larger version (31K): org.highwire.dtl.DTLVardef@1627086org.highwire.dtl.DTLVardef@1c9982eorg.highwire.dtl.DTLVardef@262730org.highwire.dtl.DTLVardef@d6e551_HPS_FORMAT_FIGEXP M_FIG C_FIG
Sumalinab, B.; Gressani, O.; Hens, N.; Faes, C.
Show abstract
This paper presents a smoothing method to estimate age-specific human contact patterns and their variations over different periods. Specifically, it examines how age-specific contact patterns shift under varying conditions, such as holiday periods and levels of public health intervention. The method uses Bayesian P-splines to smooth age-specific contact rates and leverages Laplace approximations for fast Bayesian inference, significantly reducing computational complexity. The proposed methodology is applied to the CoMix data from Belgium, a social contact survey collected during the COVID-19 pandemic. Results indicate significantly reduced contacts during periods in which strict social policies were in place, particularly among adults, and notable reductions among young individuals during holidays. This research advances our understanding of how human contact adapts in response to varying social and policy conditions, which can guide more realistic and adaptive infectious disease transmission models.
Vloeberghs, R.; Tuerlinckx, F.; Urai, A. E.; Desender, K.
Show abstract
A widely used framework for studying the computational mechanisms of decision making is the Drift Diffusion Model (DDM). To account for the presence of both fast and slow errors in empirical data, the DDM incorporates across-trial variability in parameters such as the drift rate and the starting point. Although these variability parameters enable the model to reproduce both fast and slow errors, they rely on the assumption that over trials each parameter is independently sampled. As a result, the DDM effectively predicts that errors-- whether fast or slow--occur randomly over time. However, in empirical data this assumption is violated, as error responses are often temporally clustered. To address this limitation, we introduce the autocorrelated DDM, in which trial-to-trial fluctuations in drift rate, starting point, and boundary evolve according to first-order autoregressive (AR1) processes. Using simulations, we demonstrate that, unlike the across-trial variability DDM, the autocorrelated DDM naturally accounts for temporal clustering of errors. We further show that model parameters can be reliably recovered using Amortized Bayesian Inference, even with as few as 500 trials. Finally, fits to empirical data indicate that the autocorrelated DDM provides the best account of error clustering, highlighting that computational parameters fluctuate over time, despite typically being estimated as fixed across trials.
Mishra, S.; Patra, R. R.; Reddy, A. S.; Mandal, A.; Majumdar, A.
Show abstract
Genome-wide gene-environment (GxE) interaction studies have seen limited success in detecting reliable GxE signals. A standard genome-wide GxE scan assumes a single genetic mode of inheritance, such as an additive model. It can lead to reduced statistical power when the true genetic model is non-additive, such as a recessive model. We propose a robust GxE testing approach that uses Cauchy p-value aggregation. It combines the p-values from GxE tests based on the additive, dominant, and recessive genetic models. Using extensive simulation studies, we demonstrate that the p-value combination strategy offers a robust and powerful approach to identifying GxE interactions regardless of the underlying genetic model. The method is substantially more powerful than the additive model when the true genetic model is recessive. It is also more powerful than the general two-degree-of-freedom genotypic test for GxE interaction. We apply our approach to analyze GxE interactions in the UK Biobank data across several combinations of phenotypes and environmental factors. For glycated hemoglobin (HbA1c) level, treating cumulative smoking exposure as the lifestyle factor, our approach identified 82 independent GxE loci while controlling FDR at 5%. The GxE test based on the additive genetic model detected 24 loci. For type 2 diabetes with sleep duration as a lifestyle factor, the proposed approach detected 563 independent GxE loci at 5% FDR, substantially exceeding the number of discoveries by the other approaches.
Chang, S.; Zhao, W.; Ma, Y.; Sandstede, B.; Singh, R.
Show abstract
Inferring gene regulatory networks (GRNs) from gene expression is a crucial task for understanding functional relationships. Gene expression data (transcriptomics) provide a snapshot of gene activity, encoding information about gene regulatory relationships. However, gene regulation is a dynamic process, modulating across time and with different cell types. Temporal GRN inference methods aim to capture these dynamics by utilizing time-stamped transcriptomics, gene expression data of similar samples captured across discrete timepoints, or pseudotime transcriptomics, computationally ordering cells based on an inferred trajectory. These methods can estimate constant or temporal gene regulatory relationships, but may not capture finer, cell type specific relationships. We propose ctOTVelo, an extension to our previous work to account for cell type specificity during GRN inference. ctOTVelo incorporates cell type labels or proportions when inferring the GRN from single cell transcriptomics data. Our methods achieve state-of-the-art performance in GRN prediction in time-stamped and pseudotime-stamped transcriptomics. Furthermore, ctOTVelo is able to generate cell type specific GRNs, allowing cell type resolution analysis of gene regulatory relationships.
Easwar, A.; Narayanan, M.
Show abstract
Distinguishing correlation from causation is a fundamental challenge in many scientific fields, including biology, especially when interventions like randomized controlled trials are infeasible and only observational data are available. Methods based on statistical tests of conditional independence within the Mendelian Randomization framework can detect causality between two observed variables that are each associated with a third instrumental variable. However, these methods for detecting causal relationships between traits (e.g., two gene expression or clinical traits associated with a genetic variant, all observed in the same population) often assume a linear relationship, thereby hindering the discovery of causal gene networks from genomics data.We have developed NLCD, a method for NonLinear Causal Discovery from genomics data based on nonlinear regression modeling and conditional feature importance scoring. NLCD uses these techniques to extend the statistical tests in an existing linear causal discovery method called the Causal Inference Test (CIT). We benchmarked NLCD against current state-of-the-art methods: CIT, Findr, and MRPC. On simulated datasets, NLCD performs comparably to most methods in detecting linear relations (Average AUPRC (Area Under the Precision-Recall Curve) of NLCD=0.94, CIT=0.94, Findr=0.94, and MRPC=0.99), and outperforms them in detecting nonlinear (sine and sawtooth type) relations between two genes (Average AUPRC of NLCD=0.76, CIT=0.60, Findr=0.56, and MRPC=0.73). When tested on a nonlinear subset of a yeast genomic dataset to recover known causal relations involving transcription factors, NLCD and CIT performed comparable to each other and slightly better than Findr and MRPC (Average AUPRC of NLCD=0.82, CIT=0.81, Findr=0.71, and MRPC=0.54). On application to a human genomic dataset, NLCD revealed active causal gene pairs (IRF1 [->] PSME1 and HLA-C [->] HLA-T) in the muscle tissue, and clarified the promises and challenges in discovering causal gene networks in tissues under in vivo human settings. AvailabilityThe code implementing our method is available at: https://github.com/BIRDSgroup/NLCD.
Cai, X.; Garcia-Garcia, S.; Kuhnen, L.; Gianniou, M.; Garcia Vallejo, J. J.
Show abstract
Advances in spectral flow cytometry have enabled the simultaneous measurement of dozens of markers across millions of cells within a single experiment. Despite the increasing maximum perplexity achievable in spectral panels, panel design remains constrained. A central obstacle is signal spread-- unmixed fluorescence signal misattributed to unrelated channels--which reduces the resolution of cell populations. Here we introduce the Residual Model, a robust, scalable, and interpretable model-based approach for spread prediction during panel design. The Residual Model integrates statistical features derived from single-color controls and predicts spread under Ordinary Least Squares unmixing, the most widely used unmixing method. We demonstrate its reliable predictive performance across 141 single-color control samples measured on two instruments. To facilitate practical application, we developed the USERM R package, which implements the Residual Model and provides an out-of-box solution for interactive spread prediction and visualization.
Sapoval, N.; Nakhleh, L.
Show abstract
Gene tree parsimony (GTP) is a common approach for efficient reconciliation of multiple discordant gene tree phylogenies for the inference of a single species tree. However, despite the popularity of GTP methods due to their low computational costs, prior work has shown that some commonly employed parsimony costs are statistically inconsistent under the multispecies coalescent process. Furthermore, a fine-grained analysis of the inconsistency has indicated potentially complimentary behavior of duplication and deep coalescence costs for symmetric and asymmetric species trees. In this work, we prove inconsistency of GTP estimators for all linear combinations of duplication, loss and deep coalescence scores. We also explore empirical implications of this result evaluating inference results of several GTP cost schemes under varying levels of incomplete lineage sorting.
Zhang, S.; Lu, Y.; Luo, Q.; An, L.
Show abstract
Identifying cell type-specific expressed genes (marker genes) is essential for understanding the roles and interactions of cell populations within tissues. To achieve this, the traditional differential analysis approaches are often applied to individual cell-type bulk RNA-seq and single-cell RNA-seq data. However, real-world datasets often pose challenges, such as heterogeneous bulk RNA-seq and incomplete scRNA-seq. Heterogeneous bulk RNA-seq amalgamates gene expression profiles from multiple cell types and results in low resolution, while incomplete scRNA-seq does not capture some cell types from the tissue, leading to unknown cell types. Traditional methods fail to identify marker genes for such unknown cell types. MiCBuS addresses this limitation by generating Dirichlet-pseudo-bulk RNA-seq based on bulk and incomplete single-cell RNA-seq data. By performing differential analysis of gene expressions on bulk and Dirichlet-pseudo-bulk RNA-seq samples, MiCBuS can identify the marker genes of unknown cell types, enabling the identification and characterization of these elusive cellular components. Simulation studies and real data analyses demonstrate that MiCBuS reliably and robustly identifies marker genes specific to unknown cell types, a capability that traditional differential analysis methods cannot achieve. Availability and implementationMiCBuS is implemented in the R language and freely available at https://github.com/Shanshan-Zhang/MiCBuS.